London Bikes data

We’ll use data from http://www.tfl.gov.uk to analyse usage of the London Bike Sharing scheme. This data has already been downloaded for you and exists in a CSV (Comma Separated Values) file that you have to read in to R.

There is no dropdown menu to read in your data to R, so instead we use functions like read_csv to load data from file into R objects.

#read the CSV file
bike <- read_csv(here::here("data", "london_bikes.csv")) 

Cleaning our data

Sometimes our data needs a bit of ‘cleaning’. For instance, day_of_week is variable type character, or chr. We should, however, treat it as a categorical, or factor variable and relevel it, so Monday is the first level of the factor (or first day of week), etc.

R is fairly sensitive with dates. When you read a CSV file, the date may be in different formats. For instance, Christmas 2017 could be input as 12-25-2017, 25.12.2017, 25 Dec 2017, Dec 25, 2017, etc. To be consistent, we use lubridate’s ymd function, and force variable Day to be a date in the format YYYY-MM-DD

Finally, we can turn season from 1, 2, 3, 4, to words like Winter, Spring, etc.

We will be talking more about data wrangling later, but for now just execute the following piece of code.

# fix dates using lubridate, and generate new variables for year, month, month_name, day, and day_of _week
bike <- bike %>%   
  mutate(
    year=year(date),
    month = month(date),
    month_name=month(date, label = TRUE),
    day_of_week = wday(date, label = TRUE),
    wday = as.character.factor(day_of_week),
    weekend = if_else((day_of_week == "Sat" | day_of_week == "Sun"), TRUE, FALSE)) 

# generate new variable season_name to turn seasons from numbers to Winter, Spring, etc
bike <- bike %>%  
  mutate(
    season_name = case_when(
      month_name %in%  c("Dec", "Jan", "Feb")  ~ "Winter",
      month_name %in%  c("Mar", "Apr", "May")  ~ "Spring",
      month_name %in%  c("Jun", "Jul", "Aug")  ~ "Summer",
      month_name %in%  c("Sep", "Oct", "Nov")  ~ "Autumn",
    ),
    season_name = factor(season_name, 
                         levels = c("Winter", "Spring", "Summer", "Autumn")))
    
#examine the structure of the datafame
skim(bike)
Data summary
Name bike
Number of rows 5084
Number of columns 40
_______________________
Column type frequency:
character 6
factor 3
logical 2
numeric 26
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
preciptype 1346 0.74 4 9 0 3 0
conditions 0 1.00 4 28 0 11 0
description 0 1.00 26 85 0 78 0
icon 0 1.00 4 17 0 6 0
set 0 1.00 5 5 0 1 0
wday 0 1.00 3 3 0 7 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
month_name 0 1 TRUE 12 Jan: 434, Mar: 434, May: 434, Aug: 434
day_of_week 0 1 TRUE 7 Fri: 727, Sat: 727, Sun: 726, Mon: 726
season_name 0 1 FALSE 4 Spr: 1288, Aut: 1274, Win: 1264, Sum: 1258

Variable type: logical

skim_variable n_missing complete_rate mean count
severerisk 5084 0 NaN :
weekend 0 1 0.29 FAL: 3631, TRU: 1453

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bikes_hired 0 1.00 26370.17 9593.67 0.0 19616.00 26047.50 32985.75 73094.00 ▂▇▆▁▁
tempmax 0 1.00 14.64 6.36 -2.6 10.00 14.30 19.40 36.10 ▁▇▇▃▁
tempmin 0 1.00 7.73 5.02 -11.1 4.00 7.80 11.50 22.10 ▁▃▇▇▁
temp 0 1.00 11.13 5.47 -5.9 7.10 11.00 15.40 28.30 ▁▅▇▅▁
feelslikemax 0 1.00 13.98 7.24 -7.2 10.00 14.30 19.40 38.20 ▁▅▇▃▁
feelslikemin 0 1.00 6.04 6.36 -14.6 1.00 5.80 11.50 22.10 ▁▅▇▇▂
feelslike 0 1.00 9.97 6.67 -9.1 4.80 10.50 15.33 28.90 ▁▆▇▇▁
dew 0 1.00 7.33 4.67 -8.0 4.00 7.50 10.90 19.00 ▁▃▇▇▂
humidity 0 1.00 79.34 10.43 39.6 72.30 80.30 87.30 99.90 ▁▂▅▇▅
precip 0 1.00 1.64 4.75 0.0 0.00 0.19 1.38 168.20 ▇▁▁▁▁
precipprob 0 1.00 73.35 44.22 0.0 0.00 100.00 100.00 100.00 ▃▁▁▁▇
precipcover 0 1.00 8.67 10.97 0.0 0.00 8.33 12.50 100.00 ▇▁▁▁▁
snow 0 1.00 0.02 0.25 0.0 0.00 0.00 0.00 8.40 ▇▁▁▁▁
snowdepth 0 1.00 0.05 0.56 0.0 0.00 0.00 0.00 18.10 ▇▁▁▁▁
windgust 1126 0.78 42.50 14.80 9.4 31.30 41.40 52.08 120.90 ▅▇▃▁▁
windspeed 0 1.00 23.06 8.44 5.1 16.90 21.90 27.80 72.20 ▅▇▂▁▁
winddir 0 1.00 196.67 92.57 0.1 126.47 221.40 258.83 359.80 ▃▂▃▇▃
sealevelpressure 0 1.00 1014.86 10.55 966.6 1008.60 1015.80 1022.00 1047.80 ▁▁▇▇▁
cloudcover 0 1.00 60.74 22.23 0.0 46.70 62.70 77.53 100.00 ▁▃▇▇▅
visibility 0 1.00 22.93 10.67 0.2 15.50 22.40 29.10 62.50 ▃▇▆▁▁
solarradiation 0 1.00 113.71 85.28 0.0 41.18 93.60 172.52 358.60 ▇▅▃▂▁
solarenergy 0 1.00 9.81 7.37 0.0 3.50 8.10 14.90 31.00 ▇▅▃▂▁
uvindex 0 1.00 4.29 2.54 0.0 2.00 4.00 6.00 10.00 ▇▆▅▅▁
moonphase 0 1.00 0.48 0.29 0.0 0.25 0.49 0.75 0.98 ▇▇▇▇▇
year 0 1.00 2017.04 4.04 2010.0 2014.00 2017.00 2021.00 2024.00 ▆▇▇▇▆
month 0 1.00 6.52 3.46 1.0 4.00 7.00 10.00 12.00 ▇▅▅▅▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2010-07-30 00:00:00 2024-06-29 00:00:00 2017-07-14 12:00:00 5084
sunrise 0 1 2010-07-30 05:17:39 2024-06-29 04:46:46 2017-07-14 16:57:02 5084
sunset 0 1 2010-07-30 20:50:39 2024-06-29 21:21:22 2017-07-15 09:10:33 5084

Summary Statistics

Besides the number of bikes hired each day, we also have data on the weather for that day

Having loaded and cleaned our data, we can create summary statistics using mosaic’s favstats. This is uni-variate analysis, so in our mosaic syntax we will use favstats(~ bikes_hired, data= bike). We also want to get an overall, time-series plot of bikes over time; for the latter, we just create a scatter plot of bikes_hired vs Day.

favstats(~ bikes_hired, data= bike)
minQ1medianQ3maxmeansdnmissing
01.96e+042.6e+043.3e+047.31e+042.64e+049.59e+0350840

While this analysis shows us overall statistics, what if we wanted to get summary statistics by year, day_of_week, month, or season? Mosaic’s syntax Y ~ X alows us to facet our analysis of a variable Y by another variable X using the syntax favstats( Y ~ Z, data=...)

favstats(bikes_hired ~ year, data=bike)
yearminQ1medianQ3maxmeansdnmissing
20102.76e+039.3e+03 1.4e+04 1.87e+042.8e+04 1.41e+045.62e+031550
20114.56e+031.63e+042.03e+042.37e+042.94e+041.96e+045.5e+03 3650
20123.53e+031.93e+042.62e+043.25e+044.71e+042.6e+04 9.43e+033660
20133.73e+031.76e+042.2e+04 2.74e+043.56e+042.2e+04 7.28e+033650
20144.33e+032.05e+042.77e+043.44e+044.9e+04 2.75e+049.07e+033650
20155.78e+032.21e+042.66e+043.29e+047.31e+042.7e+04 8.55e+033650
20164.89e+032.24e+042.79e+043.51e+044.66e+042.82e+048.85e+033660
20175.14e+032.41e+042.95e+043.45e+044.6e+04 2.86e+048.38e+033650
20185.86e+032.18e+042.92e+043.77e+044.61e+042.9e+04 1.02e+043650
20195.65e+032.42e+042.89e+043.45e+044.47e+042.86e+048.09e+033650
20204.87e+032.01e+042.75e+043.65e+047.02e+042.85e+041.16e+043660
20216.25e+032.17e+043.1e+04 3.82e+045.69e+043e+04       1.1e+04 3650
20220       2.51e+043.14e+044e+04       6.7e+04 3.15e+041.03e+043650
20236.72e+031.99e+042.37e+042.78e+043.53e+042.34e+046.03e+033650
20247.48e+031.85e+042.29e+042.71e+043.52e+042.28e+045.91e+031810
favstats(bikes_hired ~ day_of_week, data=bike)
day_of_weekminQ1medianQ3maxmeansdnmissing
Sun0       1.41e+042.1e+04 2.99e+046.31e+042.23e+041.06e+047260
Mon3.97e+032.06e+042.58e+043.17e+046.7e+04 2.61e+048.37e+037260
Tue3.76e+032.24e+042.78e+043.46e+046.53e+042.82e+048.71e+037260
Wed4.33e+032.26e+042.79e+043.44e+045.44e+042.83e+048.65e+037260
Thu5.65e+032.25e+042.73e+043.46e+047.31e+042.83e+048.88e+037260
Fri5.4e+03 2.14e+042.65e+043.32e+046.7e+04 2.71e+048.64e+037270
Sat0       1.6e+04 2.25e+043.13e+047.02e+042.43e+041.12e+047270
favstats(bikes_hired ~ month_name, data=bike)
month_nameminQ1medianQ3maxmeansdnmissing
Jan3.73e+031.39e+041.9e+04 2.33e+043.8e+04 1.87e+045.99e+034340
Feb3.53e+031.6e+04 2.06e+042.46e+045.25e+042.03e+046.37e+033960
Mar5.06e+031.76e+042.31e+042.72e+045.66e+042.27e+047.59e+034340
Apr4.87e+032.1e+04 2.59e+043.08e+044.9e+04 2.6e+04 7.61e+034200
May1.07e+042.45e+042.99e+043.6e+04 7.02e+043.04e+048.44e+034340
Jun6.06e+032.77e+043.31e+043.92e+046.53e+043.34e+048.59e+034190
Jul5.56e+032.99e+043.69e+044.17e+047.31e+043.52e+048.37e+034050
Aug4.3e+03 2.55e+043.33e+043.89e+046.7e+04 3.16e+049.83e+034340
Sep0       2.51e+043.18e+043.65e+045.18e+043.06e+048.26e+034200
Oct7.07e+032.33e+042.8e+04 3.27e+044.79e+042.75e+046.9e+03 4340
Nov6.03e+031.87e+042.37e+042.8e+04 4.47e+042.32e+046.52e+034200
Dec2.76e+031.16e+041.66e+042.29e+043.91e+041.71e+046.94e+034340
favstats(bikes_hired ~ season_name, data=bike)
season_nameminQ1medianQ3maxmeansdnmissing
Winter2.76e+031.38e+041.88e+042.35e+045.25e+041.86e+046.58e+0312640
Spring4.87e+032.08e+042.61e+043.13e+047.02e+042.64e+048.51e+0312880
Summer4.3e+03 2.77e+043.43e+043.99e+047.31e+043.34e+049.08e+0312580
Autumn0       2.18e+042.74e+043.29e+045.18e+042.71e+047.86e+0312740

Exploratory Data Analysis

Time series plot of bikes rented

While summary statistics allow us to quickly disover key metrics that represent the data, they are often not sufficient by themselves. Often, it is useful to represent the data graphically (or visually) to uncover information that is critical for decision making, such as trends, patterns and outliers.

In this section we will create a time-series scatter-plot that shows the number of bikes hired on each day. We will use the ggplot2 library.

Creating a basic plot is a two-step process. The first step is the specify the data frame and the axes by using the syntax: ggplot(data frame, aes(x=variable, y=variable). The second step is to specify the plot that we want. In this case, we want a scatter (point) plot using geom_point() after a + sign.

ggplot(bike, aes(x=date, y=bikes_hired))+
  geom_smooth()+
  geom_point(alpha = 0.3)+
  theme_bw()+
  NULL
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Further graphs

# Histogram faceted by month_name in 4 rows
ggplot(bike, aes(x=bikes_hired))+
  geom_histogram()+
  facet_wrap(~month_name, nrow = 4)+
  theme_bw()+
  NULL
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Boxplot of bikes_hired by month_name
ggplot(bike, aes(x=month_name, y= bikes_hired))+
  geom_boxplot()+
  theme_bw()+
  NULL

# bikes_hired vs. weather features
ggplot(bike, aes(x=temp, y= bikes_hired))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  NULL
## `geom_smooth()` using formula = 'y ~ x'

ggplot(bike, aes(x=temp, y= bikes_hired, colour=season_name))+
  geom_point(alpha = 0.2)+
  geom_smooth(method = "lm", se=FALSE)+
  theme_bw()+
  #  facet_wrap(~season_name, ncol=1)+
  NULL
## `geom_smooth()` using formula = 'y ~ x'

# bikes on humidity
ggplot(bike, aes(x=humidity, y= bikes_hired))+
  geom_point()+
  geom_smooth(method = "lm")+
  theme_bw()+
  NULL
## `geom_smooth()` using formula = 'y ~ x'

Model building

Correlation - Scatterplot matrix

Besides the number of bikes rented out on a given day, we have also downloaded weather data for London such as temp, humidity, sealevelpressure, precip, etc. that measure weather conditions on a single day. It may be the case that more bikes are rented out when it’s warmer? Or how can we estimate the effect of rain on rentals?

Your task is to build a regression model that helps you explain the number of rentals per day.

Let us select a few of these numerical variables and create a scatterplot-correlation matrix

bike %>% 
  select(cloudcover, humidity, sealevelpressure, solarradiation, precip, solarradiation, temp, bikes_hired) %>% 
  GGally::ggpairs()

Weekend or weekdays? Is there a difference?

t.test(bikes_hired ~ weekend, data= bike)
## 
##  Welch Two Sample t-test
## 
## data:  bikes_hired by weekend
## t = 13.355, df = 2218.9, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  3665.004 4926.609
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            27597.91            23302.10

Model 0: using the mean to predict bikes_hired

We start the naive model where we just use the average to predict how many bikes we are going to rent out on a single day

favstats(~bikes_hired, data = bike)
minQ1medianQ3maxmeansdnmissing
01.96e+042.6e+043.3e+047.31e+042.64e+049.59e+0350840
# can you create a confidence interval for mean bikes_hired? What is the SE?

model0 <- lm(bikes_hired ~ 1, data= bike)

msummary(model0)
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  26370.2      134.5     196 <0.0000000000000002 ***
## 
## Residual standard error: 9594 on 5083 degrees of freedom
# plot actual data vs predicted data
broom::augment(model0) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

What is the regression’s residual standard error? What is the intercept standard error?

Model 1: bikes_hired on temp

# Define the model
model1 <- lm(bikes_hired ~ temp, data = bike)

# look at model estimated
msummary(model1)
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 14506.38     242.30   59.87 <0.0000000000000002 ***
## temp         1065.60      19.53   54.55 <0.0000000000000002 ***
## 
## Residual standard error: 7619 on 5082 degrees of freedom
## Multiple R-squared:  0.3693, Adjusted R-squared:  0.3692 
## F-statistic:  2976 on 1 and 5082 DF,  p-value: < 0.00000000000000022
# plot actual data vs predicted data
broom::augment(model1) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

  • Is the effect of temp significant? Why?
  • What proportion of the overall variability in bike rentals does temperature explain?

Model 2: bikes_hired on temp plus weekend

# Define the model
model2 <- lm(bikes_hired ~ temp + weekend, data = bike)

# look at model estimated
msummary(model2)
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  15734.9      243.6   64.59 <0.0000000000000002 ***
## temp          1064.5       18.9   56.31 <0.0000000000000002 ***
## weekendTRUE  -4254.4      228.9  -18.59 <0.0000000000000002 ***
## 
## Residual standard error: 7374 on 5081 degrees of freedom
## Multiple R-squared:  0.4095, Adjusted R-squared:  0.4093 
## F-statistic:  1762 on 2 and 5081 DF,  p-value: < 0.00000000000000022
# plot actual data vs predicted data
broom::augment(model2) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

  • Fit a regression model called model2 with the following explanatory variables: temp and weekend

    • Is the effect of temp significant? Why?
    • What proportion of the overall variability does this model explain? What is the meaning of the effect (slope) of weekendTRUE? What % of the variability of bikes_hired does your model explain?

Model 3: bikes_hired on temp plus wday

# Define the model
model3 <- lm(bikes_hired ~ temp + wday, data = bike)

# look at model estimated
msummary(model3)
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 15189.80     342.98  44.288 < 0.0000000000000002 ***
## temp         1063.56      18.77  56.650 < 0.0000000000000002 ***
## wdayMon      -859.42     384.25  -2.237              0.02535 *  
## wdaySat     -2716.08     384.11  -7.071     0.00000000000175 ***
## wdaySun     -4683.71     384.25 -12.189 < 0.0000000000000002 ***
## wdayThu      1199.99     384.25   3.123              0.00180 ** 
## wdayTue      1186.82     384.25   3.089              0.00202 ** 
## wdayWed      1249.24     384.25   3.251              0.00116 ** 
## 
## Residual standard error: 7323 on 5076 degrees of freedom
## Multiple R-squared:  0.4181, Adjusted R-squared:  0.4173 
## F-statistic:   521 on 7 and 5076 DF,  p-value: < 0.00000000000000022
# Diagnostics
autoplot(model3)

# plot actual data vs predicted data
broom::augment(model3) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

What is the meaning of the effect (slope) of, e.g., wdayMon? What % of the variability of bikes_hired does your model explain?

model4 <- lm(bikes_hired ~ temp + wday + humidity + factor(month), data = bike)

# look at model estimated
msummary(model4)
##                 Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     38655.89    1086.73  35.571 < 0.0000000000000002 ***
## temp              549.73      31.96  17.202 < 0.0000000000000002 ***
## wdayMon          -935.14     352.69  -2.651             0.008039 ** 
## wdaySat         -2659.04     352.55  -7.542 0.000000000000054446 ***
## wdaySun         -4792.36     352.69 -13.588 < 0.0000000000000002 ***
## wdayThu          1256.27     352.67   3.562             0.000371 ***
## wdayTue          1262.61     352.69   3.580             0.000347 ***
## wdayWed          1364.43     352.69   3.869             0.000111 ***
## humidity         -259.36      11.24 -23.079 < 0.0000000000000002 ***
## factor(month)2    414.96     468.66   0.885             0.375975    
## factor(month)3   1440.90     463.30   3.110             0.001881 ** 
## factor(month)4   1959.58     489.46   4.004 0.000063291875370202 ***
## factor(month)5   4374.43     522.28   8.376 < 0.0000000000000002 ***
## factor(month)6   5657.20     573.11   9.871 < 0.0000000000000002 ***
## factor(month)7   5748.44     617.19   9.314 < 0.0000000000000002 ***
## factor(month)8   2975.77     599.64   4.963 0.000000718429322077 ***
## factor(month)9   4523.38     555.88   8.137 0.000000000000000504 ***
## factor(month)10  4728.24     505.28   9.358 < 0.0000000000000002 ***
## factor(month)11  3554.63     470.98   7.547 0.000000000000052380 ***
## factor(month)12 -1474.35     458.34  -3.217             0.001305 ** 
## 
## Residual standard error: 6721 on 5064 degrees of freedom
## Multiple R-squared:  0.511,  Adjusted R-squared:  0.5092 
## F-statistic: 278.5 on 19 and 5064 DF,  p-value: < 0.00000000000000022
# plot actual data vs predicted data
broom::augment(model4) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

model5 <- lm(bikes_hired ~ temp + wday + tempmin + tempmax + feelslike + solarradiation +
               humidity + factor(month) + solarradiation + precip, data = bike)

# look at model estimated
msummary(model5)
##                  Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     30697.889   1382.125  22.211 < 0.0000000000000002 ***
## temp             -707.761    244.946  -2.889             0.003875 ** 
## wdayMon          -940.524    334.977  -2.808             0.005008 ** 
## wdaySat         -2629.407    334.918  -7.851  0.00000000000000501 ***
## wdaySun         -4835.004    335.154 -14.426 < 0.0000000000000002 ***
## wdayThu          1289.848    334.966   3.851             0.000119 ***
## wdayTue          1249.543    335.123   3.729             0.000195 ***
## wdayWed          1393.875    334.995   4.161  0.00003222801551785 ***
## tempmin          -398.679    100.523  -3.966  0.00007408336651012 ***
## tempmax           808.435    106.997   7.556  0.00000000000004916 ***
## feelslike         666.935    122.321   5.452  0.00000005207070115 ***
## solarradiation      2.924      1.855   1.577             0.114952    
## humidity         -165.448     12.165 -13.600 < 0.0000000000000002 ***
## factor(month)2    635.449    448.721   1.416             0.156798    
## factor(month)3    637.421    459.093   1.388             0.165065    
## factor(month)4    366.953    522.547   0.702             0.482562    
## factor(month)5   2828.429    578.623   4.888  0.00000104903019514 ***
## factor(month)6   4232.615    631.773   6.700  0.00000000002317766 ***
## factor(month)7   4736.883    666.070   7.112  0.00000000000130611 ***
## factor(month)8   2071.534    629.229   3.292             0.001001 ** 
## factor(month)9   3046.323    563.545   5.406  0.00000006753898883 ***
## factor(month)10  3790.875    498.740   7.601  0.00000000000003483 ***
## factor(month)11  3125.722    450.053   6.945  0.00000000000425592 ***
## factor(month)12 -1451.759    435.445  -3.334             0.000862 ***
## precip           -246.751     19.696 -12.528 < 0.0000000000000002 ***
## 
## Residual standard error: 6383 on 5059 degrees of freedom
## Multiple R-squared:  0.5594, Adjusted R-squared:  0.5573 
## F-statistic: 267.6 on 24 and 5059 DF,  p-value: < 0.00000000000000022
# plot actual data vs predicted data
broom::augment(model5) |> 
  mutate(day=row_number()) |> 
  ggplot()+
  aes(x=day, y = bikes_hired)+
  
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

vif(model5)
##                      GVIF Df GVIF^(1/(2*Df))
## temp           224.071749  1       14.969026
## wday             1.006055  6        1.000503
## tempmin         31.817179  1        5.640672
## tempmax         57.709590  1        7.596683
## feelslike       82.967682  1        9.108660
## solarradiation   3.121011  1        1.766638
## humidity         2.006571  1        1.416535
## factor(month)    8.074003 11        1.099591
## precip           1.092009  1        1.044992

Further variables/questions to explore on your own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

  • Are other weather variables useful in explaining bikes_hired?
  • We also have data on days of the week, month of the year, etc. Could those be helpful?
  • What’s the best model you can come up with?
  • Is this a regression model to predict or explain? If we use it to predict, what’s the Residual SE?

Ploting your best model predicitons vs reality

Let us say that your best model is model3. How do the predictions of your model compare to the actual data?

# plot actual data vs predicted data

my_best_model <- broom::augment(model3) %>% 
  mutate(day=row_number())



# Plot fitted line and residuals
ggplot(my_best_model, aes(x=day, y = bikes_hired)) +
  geom_point(alpha = 0.2) +
  geom_point(aes(y = .fitted), 
             shape = 1, 
             colour = "red", alpha = 0.2)+
  theme_bw()

Diagnostics, collinearity, summary tables

As you keep building your models, it makes sense to:

  1. Check the residuals, using autoplot(model_x). You will always have some deviation from normality, especially for very high values of n
  2. As you start building models with more explanatory variables, make sure you use car::vif(model_x) to calculate the Variance Inflation Factor (VIF) for your predictors and determine whether you have colinear variables. A general guideline is that a VIF larger than 10 is large, and your model may suffer from collinearity. Remove the variable in question and run your model again without it.
# Check VIF

vif(model2)
##    temp weekend 
## 1.00001 1.00001
vif(model3)
##          GVIF Df GVIF^(1/(2*Df))
## temp 1.000077  1        1.000038
## wday 1.000077  6        1.000006
vif(model4)
##                   GVIF Df GVIF^(1/(2*Df))
## temp          3.440138  1        1.854761
## wday          1.001375  6        1.000115
## humidity      1.544302  1        1.242700
## factor(month) 3.928781 11        1.064172

Comparison of different models

Create a summary table, using huxtable (https://am01-sep23.netlify.app/example/modelling_side_by_side_tables/) that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error. If you want to add more models, just make sure you do not forget the comma , after the last model, as shown below

# produce summary table comparing models using huxtable::huxreg()
huxreg(model0, model1, model2, model3, model4,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
       bold_signif = 0.05
) %>% 
  set_caption('Comparison of models')
Comparison of models
(1)(2)(3)(4)(5)
(Intercept)26370.172 ***14506.378 ***15734.845 ***15189.803 ***38655.890 ***
(134.549)   (242.303)   (243.625)   (342.975)   (1086.735)   
temp        1065.598 ***1064.468 ***1063.559 ***549.733 ***
        (19.533)   (18.903)   (18.774)   (31.958)   
weekendTRUE                -4254.351 ***                
                (228.899)                   
wdayMon                        -859.420 *  -935.139 ** 
                        (384.250)   (352.685)   
wdaySat                        -2716.075 ***-2659.042 ***
                        (384.114)   (352.555)   
wdaySun                        -4683.707 ***-4792.364 ***
                        (384.248)   (352.689)   
wdayThu                        1199.987 ** 1256.269 ***
                        (384.247)   (352.670)   
wdayTue                        1186.822 ** 1262.607 ***
                        (384.247)   (352.686)   
wdayWed                        1249.237 ** 1364.426 ***
                        (384.247)   (352.694)   
humidity                                -259.360 ***
                                (11.238)   
factor(month)2                                414.955    
                                (468.657)   
factor(month)3                                1440.898 ** 
                                (463.301)   
factor(month)4                                1959.576 ***
                                (489.460)   
factor(month)5                                4374.431 ***
                                (522.280)   
factor(month)6                                5657.200 ***
                                (573.105)   
factor(month)7                                5748.443 ***
                                (617.188)   
factor(month)8                                2975.771 ***
                                (599.642)   
factor(month)9                                4523.377 ***
                                (555.882)   
factor(month)10                                4728.243 ***
                                (505.279)   
factor(month)11                                3554.632 ***
                                (470.980)   
factor(month)12                                -1474.351 ** 
                                (458.336)   
#observations5084        5084        5084        5084        5084        
R squared0.000    0.369    0.409    0.418    0.511    
Adj. R Squared0.000    0.369    0.409    0.417    0.509    
Residual SE9593.667    7619.480    7373.692    7323.393    6721.367    
*** p < 0.001; ** p < 0.01; * p < 0.05.